Protein Sequences Classification Based on String Weighting Scheme

نویسندگان

  • Nazar M. Zaki
  • Safaai Deris
  • Rosli Illias
چکیده

Motivation: We present a new technique to recognize remote protein homologies relies on combining probabilistic modeling and supervised learning in high-dimensional feature spaces. The main novelty of our technique is the method of constructing feature vectors using Hidden Markov Model and the combination of this representation with a classifier capable of learning in very sparse high-dimensional spaces. Each feature vector records the sensitivity of each protein domain to a previously learned set of sub-sequences (strings). Unlike other previous methods, our method takes in consideration the conserved and non-conserved regions within the protein sequences of interest. The system subsequently utilizes Support Vector Machines (SVM) classifiers to learn the boundaries between structural protein classes. Results: Experiments show that this method, which we call the String Weighting Scheme-SVM (SWS-SVM) method, significantly improves on previous methods for the classification of protein domains based on remote homologies. Our method is then compared to five existing homology detection methods. Contact: [email protected], http://www.kp.fsksm.utm.my/Main/research/geno me/zakiall1.html Supplementary Information: The datasets, models, and results files used in this paper are available online: (http://www.kp.fsksm.utm.my/Main/research/geno me/paper1.html).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Scheme for Improving Accuracy of KNN Classification Algorithm Based on the New Weighting Technique and Stepwise Feature Selection

K nearest neighbor algorithm is one of the most frequently used techniques in data mining for its integrity and performance. Though the KNN algorithm is highly effective in many cases, it has some essential deficiencies, which affects the classification accuracy of the algorithm. First, the effectiveness of the algorithm is affected by redundant and irrelevant features. Furthermore, this algori...

متن کامل

Accuracy of String Kernels for Protein Sequence Classification

Determining protein sequence similarity is an important task for protein classification and homology detection. Typically this may be done using sequence alignment algorithms, yet fast and accurate alignment-free kernel based classifiers exist. Viewing sequences as a “bag of words”, we test a simple weighted string kernel, investigating the effects of k-mer length, sequence length and choice of...

متن کامل

Iterative scheme based on boundary point method for common fixed‎ ‎point of strongly nonexpansive sequences

Let $C$ be a nonempty closed convex subset of a real Hilbert space $H$. Let ${S_n}$ and ${T_n}$ be sequences of nonexpansive self-mappings of $C$, where one of them is a strongly nonexpansive sequence. K. Aoyama and Y. Kimura introduced the iteration process $x_{n+1}=beta_nx_n+(1-beta_n)S_n(alpha_nu+(1-alpha_n)T_nx_n)$ for finding the common fixed point of ${S_n}$ and ${T_n}$, where $uin C$ is ...

متن کامل

GENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION

This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the seque...

متن کامل

Weighted Symbols-Based Edit Distance for String-Structured Image Classification

As an alternative to vector representations, a recent trend in image classification suggests to integrate additional structural information in the description of images in order to enhance classification accuracy. Rather than being represented in a p-dimensional space, images can typically be encoded in the form of strings, trees or graphs and are usually compared either by computing suited met...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003